Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Optimizing Exploration via Q-Value Underestimation in Multi-Agent Reinforcement Learning

  

  • Published:2026-01-27

基于Q值低估的多智能体强化学习探索优化方法

Abstract: Although value decomposition methods are widely used in multi-agent reinforcement learning, bias propagation caused by bootstrapping and maximization often leads to overestimation of Q-values, causing agents to get stuck in suboptimal strategies and resulting in significant fluctuations between success and failure in training. Traditional exploration strategies struggle to address this issue because they cannot guide agents out of suboptimal strategies. To solve this, we propose the Quest method, which dynamically underestimates Q-values to break the balance of suboptimal convergence in training and uses asymmetric bias to guide agents in more effective policy search. The key contribution of this paper is overcoming the limitations of traditional methods that optimize Q-network decomposition to improve performance. We propose an external intervention mechanism that dynamically guides the agent's exploration, bypassing the bottleneck of complex decomposition structures and effectively enhancing the agent's performance. We tested Quest in the StarCraft II multi-agent scenarios, where it showed a 130% improvement in robustness against suboptimal convergence and a 190% increase in win rate in complex scenarios like 6h_vs_8z. These results show that Quest improves exploration, stability, and performance in multi-agent scenarios.

摘要: 尽管价值分解方法在多智能体强化学习中广泛应用,但由于自举和最大化操作引起的偏差传播,Q值常被过度估计,导致智能体陷入次优策略,训练结果在成功与失败之间波动较大。传统的探索策略难以有效解决这一问题,因为它们无法引导智能体摆脱次优策略。为此,本文提出了Quest方法,通过动态低估Q值来打破训练中的次优收敛的平衡,并通过非对称偏差引导智能体进行更有效的策略搜索。本文的核心贡献在于突破了传统通过优化Q网络分解来提升性能的局限,提出了一种外部干预机制,通过动态引导智能体的探索,绕过了复杂分解结构的瓶颈,从而有效提升了智能体的表现。为了验证该方法的有效性,本文在StarCraft II的多智能体环境中进行了实验。实验结果表明,Quest显著提高了智能体对次优收敛的鲁棒性,减少了训练过程中的性能波动。根据“抵抗次优收敛的鲁棒性(RASC)”指标测量,在如6h_vs_8z等复杂环境中,Quest的RASC指标平均提高了130%,同时平均胜率提升了190%。这些实验结果表明,Quest在复杂的多智能体环境中有效优化了探索策略,增强了智能体的学习稳定性和最终性能。